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Abstract — We propose a distributed algorithm for solving the 
optimization problem Basis Pursuit (BP). BP finds the least t\- 
norm solution of the underdetermined linear system Ax = b and 
is used, for example, in compressed sensing for reconstruction. 
Our algorithm solves BP on a distributed platform such as a 
sensor network, and is designed to minimize the communication 
between nodes. The algorithm only requires the network to be 
connected, has no notion of a central processing node, and 
no node has access to the entire matrix A at any time. We 
consider two scenarios in which either the columns or the rows 
of A are distributed among the compute nodes. Our algorithm, 
named D-ADMM, is a decentralized implementation of the 
alternating direction method of multipliers. We show through 
numerical simulation that our algorithm requires considerably 
less communications between the nodes than the state-of-the-art 
algorithms. 

Index Terms — Basis pursuit, distributed optimization, sensor 
networks, augmented Lagrangian 

I. Introduction 
Basis Pursuit (BP) is the convex optimization problem [ 1 ] 



minimize ||x||i 
subject to Ax = b, 



(BP) 



where the optimization variable is x 6 M. n , \\x\\i = \xi\ + 
■ ■ ■ + \x n \ is the l\ norm of the vector x, and A € M mx " 
is a matrix with more columns than rows: m < n. In words, 
BP seeks the "smallest" (in the l\ norm sense) solution of 
the underdetermined linear system Ax = b. To make sure 
that Ax — b has at least one solution, we require the following. 

Assumption 1. A is full rank. 

BP has recently attracted attention due to its ability to find 
the sparsest solution of a linear system under certain conditions 
(see 12, 0). In particular, BP is a convex relaxation of the 
combinatorial and nonconvex problem obtained by replacing 
the l\ norm in ( IBPb by the £q pseudonorm ||a;||o, which counts 
the number of nonzero elements of x. Note that the linear 
system Ax = b has a unique fc-sparse solution, i.e., a solution 
whose £q norm is k, if every set of 2k columns of A is linearly 
independent. 
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Figure 1. Row partition and column partition of A into P blocks. We assume 
there are P nodes and the pth node stores A p . In the row partition a block 
is a set of rows, while in the column partition a block is a set of columns. 



BP belongs to a set of optimization problems that has appli- 
cations in many areas of engineering. Examples include signal 
and image denoising and restoration (TJ, 0, compression, 
fitting and approximation of functions [4 |, channel estimation 
and coding [3 | and compressed sensing |5|, [6| (for more ap- 
plications see for example Q, 12 an d the references therein). 
In particular, in the recent field of compressed sensing, BP 
plays a key role in the reconstruction of a signal. 

Notice that Assumption Q] holds with probability one if 
the entries of A are independent and identically distributed 
(i.i.d.) and drawn from some (non-degenerate) probability 
distribution, as commonly seen in compressed sensing 0. 
Also in compressed sensing, there are several strategies to deal 
with noisy observations, i.e., when the observation vector b is 
corrupted with noise. These include solving variations of (IBPI i. 
namely BPDN [1] and LASSO 0. 

Problem statement and contribution. Consider a network 
(e.g., a sensor network) with P compute nodes, and partition 
the matrix A into P blocks. Our goal is to solve BP in a 
distributed way. By distributed we mean that there is no notion 
of a central processing node and that the pth node has only 
access to the block A p of A during the execution. 

We partition A into blocks in two different ways, which we 
call row partition and column partition, visualized in Figure Q] 
In the row partition, the block A p contains m p rows of A, 
which implies m%+- ■ -+mp = m. In the column partition, A p 
contains n p columns of A, which implies n\ + ■ ■ ■ +np = n. 

In summary: given a network, we solve BP in a distributed 
way, either in the row partition or in the column partition. 

For the network we only require: 

Assumption 2. The given network is connected and static. 

Connected means that for any two nodes there is a path 
connecting them. Static means that the network topology does 
not change over time. 

We propose an algorithm to solve this problem and show 
through extensive simulations that it improves over previous 
work (discussed below), by reducing the total number of 
communications to achieve a given solution accuracy. The 
number of communications in distributed algorithms is an 
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important measure of performance. For example, it is well 
known that communicating with the neighboring nodes is the 
most energy-consuming task for a wireless sensor [9|; as a 
consequence, many energy-aware algorithms and protocols for 
wireless sensor networks have been proposed ifTUll . An energy- 
aware algorithm minimizes the communications and/or allows 
the nodes to become idle for some time instants. On distributed 
supercomputing platforms, on the other hand, computation 
time is the main concern. In this case, the computational 
bottleneck is again the communication between the nodes, 
and thus algorithms requiring less communications have the 
potential of being faster. 

Before we discuss related work, we provide possible appli- 
cations of our algorithm. 

Application: row partition. Given a network of P inter- 
connected sensors, we try to capture an ultra-wide band but 
spectrally sparse signal, represented in vector form as a; € W 1 . 
For simplicity, we assume the pth sensor only stores one 
row rj of A, i.e., m = P. Each sensor only captures some 
time samples at a rate far below the Nyquist rate, using for 
example a random demodulator iPTD . 0. One can represent 
each measurement as the number b p . Under certain conditions 
(0, 0, [12 1), it is possible to recover x by solving (IBPb 
with A = [ri ■ ■ ■ rp] T and b = [bi ■ ■ ■ bp] 1 . Further details 
about the matrix A and the vector b can be found in [8|. Since 
each vector r p is associated with a sensor, this corresponds 
to our row partition case. This scenario applies, for example, 
to sparse event detection in wireless networks lfl"3l . and to 
distributed target localization in sensor networks fl4l . 

Application: column partition. The work |[T5l introduces a 
method of speeding up seismic forward modeling in geological 
applications. The goal is to find the Green's functions of some 
model of a portion of the earth's surface. Given a set of sources 
and a set of receivers, from the knowledge of both the emitted 
and the received signals, the Green's function of the model, 
represented by x, has to be found. The authors of [151 propose 
to solve this problem when all sources emit at the same time 
and the receivers capture a linear superposition of all signals. 
The approach is then to solve BP, where a set of columns of A 
is associated with a source. Note that a distributed solution 
makes sense because the sources are physically far apart. 

As another example for the column partition, we interpret 
BP as finding a sparse representation of a given signal b with 
respect to a dictionary of atomic signals (columns of A). It is 
common to assume that the dictionary (the matrix A) contains 
several families of functions, e.g., Fourier, DCT, wavelets, to 
become overcomplete. Suppose that we are given P proces- 
sors, each of which is tuned to perform computations for 
a certain family of functions. In this case, solving BP in a 
column partition framework would arise naturally. 

Algorithms for solving BP and related work. Since BP 
can be recast as a linear program (LP) [4|, any algorithm that 
solves LPs can also solve BP. Among the many algorithms 
solving LPs [16], most cannot be readily adapted to our 
distributed scenario. For example, the (distributed) simplex 
algorithm ifTTl . Ifl8] | can solve LPs only in complete networks, 
i.e., those with a link between any pair of nodes. In this paper, 
we aim to solve BP for every connected network topology. 



In recent years, some approaches have been proposed for 
solving general optimization problems, including BP, in dis- 
tributed networks. For example, [ 19 1 proposes a method based 
on subgradient algorithms, but these are known to converge 
very slowly. Other approaches to distributed optimization 
combine the method of multipliers (MM) with the nonlinear 
Gauss-Seidel (NGS) method or with Jacobi algorithms [20|. 
For example, ETI uses MM together with a Jacobi-type 
algorithm named diagonal quadratic approximation (DQA) to 
solve, in a distributed way, convex problems constrained by 
linear equations. Using a suitable reformulation of (IBPt . this 
method can be applied to our problem statement. In lt22l we 
analyzed how well MM together with NGS solves BP in the 
row partition scenario; and in [1231 we used a fast gradient 
algorithm in both loops. The algorithm we propose here has 
just one loop and requires considerably fewer iterations to 
converge than all the previous approaches. 

Fast algorithms solving BP in a non-distributed way in- 
clude spgll El, fpc J25), LARS (26), C-SALSA flU), and 
NESTA [ 28 1 . These are faster than distributed algorithms but 
require that A and b are available at the same location. In 
contrast, a distributed algorithm can solve problems that can 
only fit into the combined memory of all the nodes. 

The work [29 1 is closest related to ours. It solves the Basis 
Pursuit Denoising (BPDN) [l] (a noise-robust version of BP), 
which also produces sparse solutions of linear systems. The 
algorithm is called D-Lasso and can be adapted to solve our 
problem. Our simulations show that the algorithm we propose 
requires systematically less communications than D-Lasso. 

Our algorithm is based on the alternating direction method 
of multipliers (ADMM). The work [30] also uses ADMM in a 
distributed scenario, but is only applicable to networks where 
all the nodes connect to a central node. Our algorithm, in 
contrast, is designed for decentralized scenarios (no central 
node) and applies to any connected network. 

Our type of matrix partitioning has been considered be- 
fore in the context of distributed algorithms for linear pro- 
grams IfTTl . ifTHl and in regression of distributed data BP . 

II. Row Partition 
In this section we partition the matrix A by rows: 



A = 



where each block A v S 



A x 
A P 



m p xn con i; ams a subset of rows 
+ mp = m. The vector b is 



of A such that mi + • • 
partitioned similarly: b = [bj ■ ■ ■ bJ,] T . We assume that A p 
and bp are available only at the pth node of a connected 
network with P compute nodes. We model the network as an 
undirected graph Q = (V, £), where V = {1, 2, . . . , P} is the 
set of nodes and £ C V x V is the set of edges. We represent 
the edge connecting nodes i and j by {i,j} or {j, i}; E is 
the total number of edges. See Figure |2] for an example graph. 
If {i, j} is an edge, then node i and node j can exchange 
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messages with each other. The set of neighbors of node p is 
written as Af p , and its degree is D p = \Af p \. 




Figure 2. Example of a connected network with P = E = 7. The set of 
edges is S = {{1, 2}, {2, 3}, {2, 4}, {4, 5}, {4, 6}, {5, 6}, {5, 7}}. 

Graph coloring. We assume that a proper coloring C = 
{1,...,C} of the graph is available. This means that each 
node is labeled with a number c E C, which we call color, 
such that no adjacent nodes (i.e., neighbors) have the same 
color. The minimum number of colors required for a proper 
coloring of a graph Q is its chromatic number x(Q)- Coloring 
a graph with x(G) colors or just computing x{Q) lS NP- 
hard for x{Q) > 2 11321 . Several distributed algorithms for 
coloring a graph exist 03, El, 03, 0<D • For example, 03 
determines a coloring with 0(D max ) colors, where D max = 
maxpDp, using 0(D max / log (Z> max ) +log*(P)) iterations. 
If more colors are allowed, for example 0(D^ ax ), then 
0(log*(P)) iterations suffice ll36l . In this paper we assume 
that a proper coloring C with C colors is given. 

Problem reformulation. To solve BP in a distributed way 
we first rewrite {BP} to make the row partition explicit: 

minimize ||a:||i .j. 
subject to A p x = b p , p = 1, . . . , P . 

The variable x is coupling the problem. To decouple, we 
replace x with P copies of x. The pth copy is denoted with x p . 
To ensure the necessary global consistency condition X\ = 
X2 = ■ ■ ■ = xp, we enforce the equivalent (since the network 
is connected) constraint Xi — Xj for each edge of the 

network: 

minimize i Sp=i II x p 1 1 1 

subject to ApX p = b p , p=l,...,P (2) 

Xi Xj , } £ ^ ■ 

The optimization variable is x := (xx, . . . , xp) £ (R") p . Note 
that (O can be written more compactly as 

minimize i 2p=i II x p 1 1 1 

subject to A p Xp = b p , p=l,...,P (3) 

(B T (g) I n )x = , 

where I n is the nxn identity matrix, and ® is the Kronecker 
product. The matrix B is the P x E node-arc incidence 
matrix of the graph: each edge {i,j} € £ corresponds to a 
column in B with the ith and jth entries equal to 1 and — 1, 
respectively. 

Algorithm for bipartite graphs. We first consider a simple 
case: Q is bipartite and hence x(G) = 2- The generalization to 
any connected graph will be straightforward. Bipartite graphs 
include trees and grid graphs. 



Without loss of generality, assume nodes 1 to c have color 1 
and the remaining have color 2. Then, 0) can be written as 

minimize i YT P =i \\ x vh + 7 Ep =c +i \\ x p\\i 
subject to A p Xp ~b p , p = l,...,P (4) 

(Bj ® I n )xi + (Bj ® I n )x 2 = , 

where x = (xi,x 2 ) £ (R"| c x (W l ) p - C and B is par- 
titioned as B = \B± _Bj] . We propose the alternating 
direction method of multipliers (ADMM, briefly described in 
appendix [A} to solve (0). The augmented Lagrangian of (0), 
dualizing only the last constraint, is 

L(x 1 ,x 2 ;X) = — ^2 \\x P \\i + p ^2 \\xp\\i +<fa(xx,\) 

+ 02(^2, A) + pxJiBiBj ®I n )x 2 , (5) 
where C x = {1, . . . , c}, C 2 = {c + 1, . . . , P}, and 

(j)i{x h X) = \ T (Bj®I n )xi + ^\\(Bj ® I n )x t \\ 2 

= ((Bj <g> I n )X) T x t + ^xJiB.Bj (g) I n ) Xi , 

for i = 1,2. Note that, since nodes in each d are not neighbors 
between themselves, BiBj is diagonal (with D p in the pth 
diagonal entry). Hence, 

(pi(xi,X) = 5Z(7>p + fA>IM 2 ) . * = 1,2, (6) 
pec; 

where 7p := EjeA^, si 8 n (-?' ~ p) X {p,j} and si g n (^) g ives 1 
if w > and — 1 otherwise. We decomposed the dual 
variable A into (. . . , A/j^i, . . .), where A/^ 3 -\ = A/^^} is 
associated with the constraint x t = Xj. 

Equations (O and (0 show that minimizing L(xi,x 2 ;X) 
with respect to (w.r.t.) xi yields c optimization problems that 
can be executed in parallel; similarly, minimizing it w.r.t. 
x 2 yields P — c parallel optimization problems. Algorithm [T] 
shows the application of ADMM to our problem. We name 
our algorithm D-ADMM, after Distributed ADMM. 



Algorithm 1 D-ADMM for bipartite graphs 

Initialization: for all p G V, set 7P 1 ' = a;^ 1 ' = and k = 1 
1: repeat 

2: for all p 6 Ci [in parallel] do 

3: Set v ( p k) = j ( p k) - pE je AA p x j k) and find 

4 fe+1) = argmin + vi k)T x p + ^f\\x p \\ 2 

s.t. ApXp = 6p 

4: Send to A" p 

5: end for 

6: Repeat l2l5l for all p £ C2, replacing by x^ +1 ^ 

7: for all p e Ci U C 2 [in parallel] do 

(fc + l) (fe) , v-^ I + 

7p = 7p + P E jeA f p ( x p - ^ ) 
8: end for 
9: fc ^- fc + 1 

10: until some stopping criterion is met 



The optimization problem in step [3] results from minimizing 
the augmented Lagrangian L(xi, x 2 ; A) w.r.t. x p . To derive it, 
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note that © enables us to rewrite L(x\, x 2 ; A) as 



= 1 p&Ci 



L(x 1 ,x 2 ;X) = ^2^2 (pIM 1 +"tp x P + \ D v\\ x v 

+ pxi(BiBj (g) I n )x 2 ■ 



The (ij)th entry of B\Bj is —1 if {i,j} G £ and other- 
wise. Therefore, pxi(BiBj ® I n )x 2 = - pJ2{i,j}ee x l x j • 
Picking p E Ci for any i = 1,2 and minimizing L(aii,X2;A) 
w.r.t. x p yields the optimization problem in step [3] AppendixIBI 
describes an efficient method for solving this problem. 

AlgorithmQJshows that nodes with the same color operate in 
parallel, whereas nodes with different colors cannot. In other 
words, the nodes from C\ have to wait for the computation 
of the nodes from C 2 and vice-versa. However, at the end 
of each iteration, every node will have communicated once 
(sending x p k+1 ^ and receiving Xj k+1 ') with all its neighbors. 

Regarding the dual variable A, its components do not appear 
explicitly in Algorithm QJ The reason is that node p only 
requires -f p — J2j£j\f sig n (j — p)^>{p,j} f° r i ts optimization 
problem. According to the canonical form of ADMM, we have 
to update Xuj\, for each edge {i,j} £ £ as 



(fe+i) (fe+i)% 



(7) 



Inserting (FT) into the expression of j p we obtain the update 
of step [7] 

The following theorem establishes the convergence of Al- 
gorithm [TJ 

Theorem 1. Assume the given graph is bipartite. Then, for 
all p, the sequence {x p k f produced by Algorithm\l\converges 
to a solution of ( IBPb . 

Proof: We have already seen that when the graph is bipar- 
tite ( IBPb is equivalent to (T4]i. We now show that <j4j satisfies 
the conditions of Theorem |4] in appendix [A] Let fi(xi) = 
(1/P) J2cec II^pIU' f° r * = 1)2. Clearly, f% and f 2 are real- 
valued convex functions. Assumption [TJ on the rank of the 
matrix A implies that (IBPb . and thus (0), is always solvable. 
Also, the non-dualized equations A p x p = b p in © define 
polyhedral sets. 

Now we have to prove that the matrices Bj ® /„ and B J ® 
/„ have full column rank, i.e., that Bj and Bj have full col- 
umn rank. We have seen that B\Bj and B 2 Bj are diagonal 
matrices because the nodes within one class are not neighbors. 
Note that the pth entry of the diagonal of BiBj (or B 2 Bj) 
is the degree of the pth node. Due to Assumption [2] there are 
no isolated nodes and thus B\Bj and B 2 Bj are full -rank. 
The result then follows because rank (BB T ) = rank (B T ) for 
any matrix B. ■ 

Theorem [TJ also shows that after Algorithm [TJ terminates, 
every node will know a solution x* of BP. 

Algorithm for general graphs. We now generalize Algo- 
rifhm[TJto arbitrary graphs with x(G) > 2- The generalization 
is straightforward, but we cannot guarantee convergence as 
in Theorem [JJ However, in our extensive experiments, shown 
later, the resulting algorithm never failed to converge. 



Let Q be a graph with a proper coloring C and let C = \C\ 
be the number of colors. Let C c be the set of nodes that have 
color c, c = 1, . . . , C. Without loss of generality, suppose the 
nodes are numbered the following way: C% = {1, . . . , |Ci|}, 

C 2 = {|C 1 | + 1,...,|C 1 | + |C 2 |}, ...,c c = {Ef=i 1 |c c | + 

1, . . . , P}. This enables a partition of the matrix B as B = 
\Bj ■ ■ ■ Bq\ , making (0) equivalent to 



minimize 



p Sc=l S 



pec c 



\ x p\h 
= L. 



subject to A p x p = b p , p 

Ec=i(Bj ® In)x c = 



,P 



(8) 



where x — [x\, . . . ,Xc) is the variable, and x c g (R")l Cc l 
for c = 1, . . . , C. From the proof of Theorem [JJ we know 
that each matrix B c has full row rank. Thus, we can apply 
the generalized ADMM to solve ^ (see Appendix [A]). That 
leads to the following algorithm. 

Algorithm 2 D-ADMM for general graphs 
Initialization: for all p € V, set 7P 1 ' = a;^ 1 ' = and k = 1 



repeat 

for c = 1, . . . , C do 

for all p 6 C c [in parallel] do 



„(fc) _ Ak) 



and find 



3<V 



Jk+l) 



argmin ^||3;p||i + v p k) x p + ^-\\x p \\ 2 



Send x p k+1) to M v 
end for 
end for 

for all p — 1, . . . , P [in parallel] do 

end for 

k «- k + 1 
until some stopping criterion is met 



Algorithm [2] is a straightforward generalization of Algo- 
rithm [TJ Now there are C classes of nodes and all the nodes 
in one class "work" in parallel, but the classes cannot work at 
the same time. Consequently, if we consider the time to solve 
one instance of the problem in step [4] as one unit, one (outer) 
iteration in Algorithm [2] takes C units. 

In the bipartite case the coordination between the nodes 
was straightforward: node p only works after it has received Xj 
from all its neighbors. Here, according to the canonical format 
of Algorithm [2j all the nodes in one class should work at 
the same time. Since these nodes are not neighbors, neither 
there is a central node to coordinate them, in practice node p 
works after having received Xj k+1 ^'s from all its neighbors 
of lower color. An alternative way to see this is to transform 
the undirected graph of the network into a directed graph, as 
shown in Figure [3] The graph in Figure |2 [Ib)| is constructed 
from the graph in Figure [Ja)] by assigning a direction to each 
edge i —> j if the color of i is smaller than the color 
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of j, and i <— j otherwise. Then, each node only starts working 
after having received the x/s from all its inward links. In 
practice, this procedure can reduce the overall execution time 
since each node does not need to wait for its "color time." As 
described in step [5] (and in contrast to what Figure | 3jb)| may 
suggest), each node sends x^ +1 to all its neighbors in each 
iteration. 




(a) Undirected 




(b) Directed 

Figure 3. (a) undirected network with x{&) = 3 and with classes Ci = 
{1,2}, C2 = {3}, Cz = {4,5}; (b) directed graph constructed from (a) 
by assigning a direction to each link: from smallest color node to the largest 
color node. 

As stated earlier, we have no proof of convergence for 
Algorithm [2] only practical evidence. 

III. Column Partition 

In this section, we adapt the algorithm for the row partition 
to the column partition case: 



A = 



A x A 2 



A f 



Each block A p 6 K mxn p contains a subset of columns of A 6 

R mxn such that ni H h np = n. The block A p is only 

available at the pth node of an arbitrary connected network, 
and the vector b £ K m is known by all the nodes. 

Duality: pros and cons. In section [II] we saw an algorithm 
that solves BP with a row partition. Here, we want to reutilize 
that algorithm for BP with a column partition. The first 
approach to that is to consider the dual problem of ( IBPb : 

minimize b T X (9) 
subject to — 1„ < A T A < 1„ , 

where the dual variable is A e IR m , and 1„ 6 R™ is the 
vector of all ones. For a derivation of (O, see for example [37 
§1.3.3]. The matrix A now appears in the constraints of (0 
as A 1 , i.e., we can partition the constraint matrix in © by 
rows. The problem is that there is no straightforward way to 
recover a solution of (1BPI) from a solution of ©. Hence we 
need an alternative approach. 



Regularizing BP. Consider the following regularized ver- 
sion of (IBPI l: 



minimize 1 1 x \ \ 1 + | 
subject to Ax = b , 



(10) 



where 5 is a small positive number. While ( IBPt may have 
multiple solutions, ( [Tol l just has one, due to the strict convexity 
of its objective. When S is small enough, ( TTOb selects the least 
^2-norm solution of (1BPI) : 



Theorem 2. There exists 5 > such that the solution of ( 1101 ) 
is a solution of (IBPI) for all < 6 < S. 

The proof of this theorem is based on exact regularization 
results for linear programming [38], ||39l . To prove it, re- 
cast ( IBPI ) as a linear program (T], regularize it, and then rewrite 
the resulting problem as ( fTOb . Consequently, we recover a 
solution of (IBPb if ( fTOb is solved for a sufficiently small S. 
The benefit of solving dTOb is that it is immediate to recover 
the solution of ( fTOb from its dual solution. We are unaware 
of any strategy for choosing 6 without first solving dBPb . We 
will thus adopt a trial-and-error strategy. 

Dual problem. We use duality because the dual problem 
of ( fTOb will have terms involving A J . Since A is partitioned 
by columns, A T will be partitioned by rows. Therefore, the 
algorithm for the row partition will be applicable with some 
minor modifications. 

The dual problem of dTOb is 



maximize 

y 



L(y) 



(ID 



where the dual function is L(y) = —b T y + mf x (\\x\\ 1 + 
(v4 T ?/) T a;+|||a;|| 2 ), and y e M" 1 the dual variable. To keep the 
notation consistent with the previous section, we recast ( ITTb 
as a minimization problem: 



minimize 

y 



b T y + 



where 



= -inf (Harll! + (A T y) T x + (S/2)\\x\\ 2 ) 



(12) 



(13) 



The objective of the inner optimization problem of dTJb has 
a unique minimizer for each y, since it is strictly convex. 
Let x(y) denote the solution of this problem, for a fixed y. 
Strong duality holds for ( fTOb because its objective is convex 
and its constraints linear |4, §5.2.3], B51 prop. 5.2.1]. There- 
fore, after we find a solution y* to the dual problem (fl2l) . a 
(primal) solution of ( fTOb is available as x(y*). This follows 
directly from the KKT conditions 01 §5.5], E01 prop.5.1.5], 
and we express it in the following theorem. 

Theorem 3. Let y* solve (fTTT) . Then, x{y*) solves ( flOl ). 

Adapting the algorithm. Now we focus on solving (fl2l) . 
Let x be partitioned analogous to A, i.e., x = (xi, . . . , Xp), 
where x v € W lp . Note that $f(y) can be decomposed as the 
sum of P functions: *&(y) = ^i(y) + ■ ■ ■ + *&p(y) , where 



= -inf INplli + {Apy) T x p + -\\x p \\ 



(14) 
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can only be computed at node p because A p is only known 
there. We can then rewrite (TT~2T > as 

minimize J2 p =i{jsb T y + *p(j/)) 

y 

Notice that ^ p (y) can be easily computed at node p, since the 
optimization problem defining it has a closed form solution. 
We now apply the same procedure as in section [II] we clone 
the variable y into several y p 's, and constrain the problem 
with yi = yj, for all {i,j} £ £. This yields 

minimize Ep = i(p iT ^ + *?fc)) (15) 
subject to (B T <X> I n )y = , 

where the variable is y = (y%, . . . ,yp) £ (R") p . Note the 
similarity between ( TT3T > and J3). Having a proper coloring of 
the graph, the generalized ADMM is applicable: 



Algorithm 3 D-ADMM for general graphs (column partition) 

Initialization: for all p € V, set jjP = x p ^ — and k = 1 
1: repeat 

2: for c = 1, . . . , C do 

3: for all p £ C c [in parallel] do 

4 fc) = 7r-pE*f +1, -pE*f 

4: and find 

y< fc+1) = argmin* p ( yp ) + («« + l b ) T y p + ^f\\y P f 

5: Send y p k+1) to M p 

6: end for 

7: end for 

8: for all p = 1, . . . , P [in parallel] do 

9: end for 

10: fc <- fc + 1 

11: until some stopping criterion is met 



Algorithm [3] is similar to Algorithm [2] except for some 
minor modifications: the size of the variable to be transmitted 
is smaller (instead of transmitting x p £ W l , now the nodes 
transmit y p £ M m ), and the optimization problem to be solved 
at each node (see step [4]i is slightly different. Since that 
problem is unconstrained and its objective is differentiable, we 
can solve it directly with the Barzilai-Borwein algorithm [41 1 
(see appendix |B"1 for more details). 

Another difference to Algorithm [2] is that after the algorithm 
finished (finding an optimal vector y*), node p will not know 
the entire solution x(y*) to (11 0t . but only a portion of it, 
x p (y*), as the solution to the optimization problem defin- 
ing in ([l4l i. In case we want the entire solution x(y*) to be 
available in all nodes, just a few additional communications are 
required because x(y*) is expected to be sparse; furthermore, a 
spanning tree can be used to spread the x p 's over the network. 

We remark that if the graph is bipartite, then Algorithm [3] 
is proven to converge to an optimal solution of ( fTOb and, if 8 
is small enough, to a solution of dBPl I, An important issue 



is the possible ill-conditioning provoked by a small value 
of 5. In fact, a very small value for 5 may lead to difficulties 
in finding y p k+1 ^ in step [4] Note that this is the only step 
where 5 appears. In our simulations, explained in section [V] 
we used S = 10~ 3 and this value allowed us to compute 
solutions to BP with a very large precision, without incurring 
into numerical problems. 



IV. Other Algorithms 

In this section we overview other methods that solve BP in 
a truly distributed way. We only cover the row partition case 
because corresponding algorithms for the column partition can 
always be derived as shown in the previous section. 

We divide the algorithms into two categories according 
to the number of (nested) loops they have: single-looped 
and double-looped. D-ADMM is single-looped and, in each 
iteration, every node transmits a vector of size n to its 
neighbors. 

Performance measure: communication steps. We say that 
a communication step has occurred after all the nodes finish 
communicating their current estimates to their neighbors. All 
single-looped algorithms have one communication step per 
iteration. The double-looped algorithms have one communica- 
tion step per iteration of the inner loop. In all algorithms, the 
size of the transmitted vector is n. Another feature common 
to all algorithms is that in every iteration (or in every inner 
iteration, for the double-looped algorithms) each node has to 
solve the optimization problem in step |4] of Algorithm [2] (or 
Algorithm [3] for the column partition). This means that the 
algorithms have a common ground for comparison: if each 
iteration (or inner iteration, for the double-looped algorithms) 
involves one communication step and all the nodes have to 
solve a similar optimization problem (same format, same di- 
mensions, but possibly different parameters), then the number 
of iterations (or the sum of inner iterations) becomes a natural 
metric to compare the algorithms. We will then compare the 
algorithms by their number of communication steps, which 
is equal to the number of iterations in the single-looped 
algorithms and to the sum of inner iterations in the double- 
looped algorithms. Note that less communication steps can be 
expected to produce significant energy savings in scenarios 
such as sensor networks (9). 

Although data is transmitted in every communication step, 
the quantity of the transmitted data might actually decrease 
with the iterations. The reason is because the solution to BP 
is sparse and, at some point, the nodes' estimates start being 
sparse, allowing a possible compression of the transmitted data 
(e.g., just transmit the nonzero entries). 

We start with describing the single-looped algorithms. 

Subgradient. Nedic and Ozdaglar were the first to pro- 
pose a subgradient-based algorithm to solve general convex 
optimization problems in a completely distributed way [42|. 
However, they only addressed unconstrained optimization 
problems, which is not our case. Instead, we will use the 
method proposed in [19|, which generalizes l42l to problems 
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with private constraints in each node. That is, (T9) solves Algorithm 4 D-Lasso (node p) 



minimize E P =i fp( x ) 
subject to x G C\ p=1 X p , 



Initialization: for all p 6 V, set 7 P 



and k - 



repeat 

for all p = 1, 



set u 



„0+i) 



(fc) 



: 7p 



(*0 



, P [in parallel] do 



PY. ] eM p u{ P } x f ) and find 



argmin i ||x p ||i + t> P x p + p-D P ||a;,, 



where each / p is convex and each AT p is a closed convex 
set. This method combines consensus algorithms |43| with 
subgradient algorithms 1 4-0 , Ch.6], and for each node p, it 
takes the form 



Jk+l) _ 
L P ~ 



r (k) T (k) 
^pp ""p 



E 



lk)(k) 



(16) 



Send Xp +1 ' to A/p, and receive 



„(M-1) 



€ AC, 



end for 

for all p = 1, . . . , P [in parallel] do 

7p fc+1) =7p fe) +pE jS ^(4 fc+1) -^ +1) ) 
end for 

k <r- k + 1 

until some stopping criterion is met 



where cy are positive weights such that E, c ij ' = Ej c ij 
1 . the sequence {aW > : fc = 1,2,...} is square summable 
but not summable, and [p]j£ is the projection of the point p 
onto the set X: [p]j^ = argmin x {4||x — p|| 2 : a; G X}. 
The vector g p is a subgradient of f p at the point c p k p 'x p — 

E 



We apply ( fT6] l directly to problem (fl}, where we see ||x||i 
as ||x||i = -pIMIi + ••• + in other words, we set 

f P (x) = We choose = l/(fc + 1) for the 

step-size sequence. In our case, since the network is static 
(Assumption [2]), the weights c,j are constant: for every p, 
c P i = l/(-Dp + 1) f° r i € Af p U {p}, and otherwise. The 

implementation of ( fToT ) in a network is now straightforward: 

(fc) (fc) 
first, node p transmits x p to its neighbors and receives x^ 

from them; then, it updates its variable with ( [ToT i. These two 

steps are repeated until convergence. 

While ( fToT ) is proven to be robust to link failures, its 
convergence speed is too slow in practice. 

D-Lasso. As mentioned in section U Bazerque and Gi- 
annakis l29| proposed a distributed algorithm that solves a 
problem similar to ours. Here, we adapt it to solve BP. The 
starting point is problem (0, which by introducing a new 
variable Zij for each edge {i,j} £ £, is reformulated as 



minimize 
subject to 



p Z^p= 



x i 



z. 



z ij ! 



'p, p = l,-- 
{i,j} g 



(17) 



This problem is solved with ADMM by dualizing its last two 
constraints. We consider the problem partitioned in terms of 



the variable z = (..., z^, . . .) and x — (. 



Up, . 



.). In short, 



ADMM minimizes the augmented Lagrangian of dTTb w.r.t. z 
and then minimizes it w.r.t. x, using the new value of z. The 
minimization w.r.t. z has a closed form solution. After some 
manipulations, the algorithm for an arbitrary node p is: 



Although D-Lasso and D-ADMM (Algorithm O have a 
similar format, they are different. For example, D-Lasso is 
synchronous and D-ADMM asynchronous, and the parameters 
of the optimization problem each node solves are different in 
both algorithms. Also, D-ADMM is proven to converge for 
bipartite graphs only, while D-Lasso is proven to converge for 
any connected graph. In the next section, we will see that, in 
practice, D-ADMM converges in less iterations than D-Lasso, 
despite their common underlying algorithm. 

We now move to the double-looped algorithms. 

Double-looped algorithms. All double-looped algorithms 
we will see have the same theoretical foundation, but use 
different subalgorithms. Namely, all solve the following dual 
problem of (O: 

maximize L(X) 
A 



(18) 



where L(X) is the augmented dual function 

L(X)= inf Ep=ipll a; plli + E{ lJ } e e^A 



(x-i Xj ) 



P=L 



,P, 



(19) 

where 4>\{z) = A T z + ^\\z\\ 2 , and p is a positive parameter. 
The algorithms have an outer loop that solves ( fT8l . and an 
inner loop that solves the optimization problem in $1% . 

We consider three distributed, double-looped 
algorithms (US, EQ, (23] to solve CD, and thus © 
because strong duality holds. While ll22l . [23] were designed 
to solve BP, OTI was designed to solve more general 
problems. We thus have to adapt the latter to our problem. 
The algorithms described in l22l . ||2~T1 . Il23l will be denoted 
respectively by MM/NGS (method of multipliers and 
nonlinear Gauss-Seidel), MM/DQA (method of multipliers 
and diagonal quadratic approximation), and DN (double 
Nesterov). 

All algorithms solve ( fT8l with an iterative scheme in the 
outer loop. As in D-ADMM, the dual variable A consists of 
several variables Xuj\ associated with the edges E £■ 

It can be shown that the dual function L(X) in $1% is differen- 
tiable and that its gradient Vi(A) = (..., Xj(A) — Xj(X), . . .) 
is Lipschitz continuous with constant 1/p iPBl . The vec- 
tor x(A) := (xi (A), xi{ A), xp( A)) solves the optimization 



s 



problem in ([T9l i for a fixed A. The algorithm for solving this 
inner problem will be the inner loop and is considered later. 
These nice properties of L(X) enable the edge-wise application 
of the gradient method J40J §1.2] 

or the edge-wise application of Nesterov's method B31 

(fc+l) _ ,(fc+l) fe_l a (fc+l) _ ,(fc) X 

to solve ( [T8l . Nesterov's method is proven to be faster than 
the gradient method. When we use the gradient method (f20]l 
to solve a dual problem, where duality here is seen in the 
augmented Lagrangian sense, the resulting algorithm is called 
method of multipliers (MM) ||40| p.408]. While MM/NGS and 
MM/DQA use MM for their outer loop, DN uses d2"TV 

So far, we assumed that a solution of the optimization prob- 
lem in (119) . for a given A, was available. Nevertheless, solving 
this problem in a distributed way is more challenging than 
solving ( fT8l (when V£(A) is readily available). The reason 
is that we cannot decouple the term Ylu j}e£ ^ x {i n i Xi ~ 
Xj) into a sum of P functions, each one depending only 
on x p . Both MM/NGS and MM/DQA use an iterative method 
that optimizes the objective of (fT9b w.r.t. one block vari- 
able x p , while keeping the other blocks fixed. More con- 
cretely, let g\(xi, . . . , xp) denote the objective of ( fT9] l when A 
is fixed. MM/NGS uses the nonlinear Gauss-Seidel (NGS) 
method ESI §3.3.5]|46|: 

(t+i) . , (t) (t) (*)■> 

x{ = arg mm g\(x l: x 2 ', x\ ', . . . , x P ') 

(t+1) . , (i+1) (t) (ih 

x 2 = arg mm g\(x\ ',x 2 , x^ ', . . . ,xy ) 
x 2 ex 2 

(22) 

(t+l) . / (t + l) (t + l) (t+l) X 

Xp = arg mm g\{x\ \x\ ,x\ x P ), 

xp£X P 

where X p := {x p : A p x p — bp], p = 1, . . . , P. It can be 
proven that any limit point of the sequence generated by d22l 
solves problem ( fT9l ); see [46], [37|. Each optimization problem 
in d22l is solved at one node. It turns out that these are 
equivalent to the problem in step [4] of Algorithm [2] Note 
that the nodes in (1221 cannot operate in parallel, akin to the 
algorithm we propose here. MM/DQA, on the other hand, 
solves the problem in (fl~9b with a parallel scheme called 
diagonal quadratic approximation (DQA): 

i (*) (*) (*)\ 
u\ = arg mm g\(xi, x\ ' ,x\ . . . , x p ') 

u 2 =arg mm g\(x\' , x 2 , x y 3 . . . , x p ') 

: (23) 

/ (t) (t) (t) \ 
up = arg mm g\(x\ ,x 2 ,x 3 , . . . ,xp) 

xp£Xp 

X^ = TUp + (1 - T) X f , p=l,...,P, 

where r = 1/P. For a proof that d23l solves (fT9l > see [21 1, 
ll37l . The difference between (1221 and (l23l is that the latter 
allows all the nodes to operate in parallel, and after the 



minimization step, each node combines the solution of the 
optimization problem it has just solved with the previous 
estimate of the solution: Xp . Note that a communication step 
has to occur after each iteration. 

Regarding DN, we made some modifications to the inner 
loop of the method proposed in ll23l . so that we could get an 
algorithm comparable with what we propose here. 

Double Nesterov (DN). In fl23l . BP is recast as a linear 
program by increasing the size of the variable to 2n. The 
result is that the problem defining the dual function has a 
differentiable objective with a Lipschitz continuous gradient, 
and thus Nesterov's method is directly applicable. However, 
the size of the variable transmitted in each communication step 
is 2n. Here, we do not recast BP as an LP. As seen before, the 
dual problem (TT8T l is solved with Nesterov's method d2ll in 
the outer loop. Now, to solve the optimization problem in ( fT9l ), 
Nesterov's method is not applicable because the objective is 
not differentiable. However, that objective can be written as 
the sum of a nondifferentiable function h(x) = J2p=i ~p\\ x p\\ i 
with a differentiable one g(x) — J2{i j}e£ « ( Xi ~ x i)- 
The gradient of g(x) w.r.t. x p is V x g(x) = 7 P + pD p x p — 
P J2jeN x o ■ Therefore, to compute V ' x v g{x), each node needs 
only to communicate with its neighbors. The gradient Vg{x) is 
Lipschitz continuous with constant /oA max (£), where A max (£) 
denotes the maximum eigenvalue of the graph Laplacian. 
FISTA ll47l is an algorithm that adapts Nesterov's method to 
this scenario. It operates the following way: 

Algorithm 5 FISTA (for node p) 

Initialization: choose a = l/(pA ma x(£)), x ( p ^ = j/p ' = 0, t = 
1: repeat 

2: up = - aVg(y£') 

3: x p t+1) = argmm Ip ^\\x p - u p \\ 2 + h(x p ) 

A. „(fc+l) _ _(*+!) , fc-1 _ T W\ 

yp — x p t k+2 ij, p x p i 

5: k 4- k + 1 

6: until some stopping criterion is met 



This modification to [23 1 allows us to compare the resulting 
algorithm with ours, because the size of the variable is now n. 
Furthermore, the problem in step [3] is equivalent to the one in 
step g] of Algorithm |2] 

Tuning parameter p. Note that all algorithms (except the 
subgradient) share the same tuning parameter p, because all 
are based on an augmented Lagrangian reformulation. It is 
known that p influences the convergence rate of augmented 
Lagrangian methods. Albeit there are self-adaptive schemes 
to update p during the algorithm [30 1, [48 1, [49], making the 
algorithms less sensitive to p, we were not able to implement 
these schemes in a distributed scenario. We will hence as- 
sume p is constant during the execution of the algorithms. 

Execution times in wireless networks. In contrast with all 
the algorithms described here (except MM/NGS), D-ADMM 
assumes a coloring scheme based on which the nodes operate 
asynchronously. Suppose all the algorithms are implemented 
on an ideal network, where packet collisions do not occur, 
i.e., two neigboring nodes can transmit messages at the same 
time without causing interference at the reception. If a com- 
munication step by D-ADMM takes T time units, then a 
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Table I 

Algorithms for comparison in the simulations. 



Acronym 


Algorithm(s) 






Source 


D-ADMM 


Alternating direction MM 




This paper 


Subgradient 


Subgradient method 






iiqi 


D-Lasso 


Alternating direction MM 






|29 


MM/NGS 


MM + nonlinear Gauss-Seidel 






[22] 


MM/DQA 


MM + diagonal quadratic approximation 




ED 


DN 


Nesterov + Nesterov 






(23] 




Table II 








Scenarios for row partition experiments. 






Scenario 


Sparco Id m 


n 


P 




1 


500 


2000 


50 




2 


7 600 


2560 


50 




3 


3 1024 


2048 


64 




4 


902 200 


1000 


50 




5 


11 256 


1024 


64 





communication step by the other algorithms takes T/C units, 
where C is the number of colors we used for the network 
(we are ignoring the optimizations that can be made from 
the procedure described in Figure [3]). Therefore, although D- 
ADMM requires less communication steps, as shown next, 
it might actually take longer than competing algorithms. 
However, in a real wireless network, packet collisions occur 
and medium-access (MAC) protocols have to be implemented 
to avoid them. Hence, synchronous algorithms cannot operate 
synchronously in wireless networks. The execution time of an 
algorithm, among other factors, is highly dependent on the 
MAC protocol. Comparing execution times is thus beyond the 
scope of this paper. 

V. Experimental Results 

In this section we compare our algorithm against the prior 
work discussed in the previous section and listed in Table U 
We focus on the row-partitioned case since the algorithm for 
the column partition is derived from it. We start describing 
how the data and the networks were generated, and how the 
experiments were carried out. In the first type of experiments 
we compare all the algorithms on moderate-sized networks 
(around 50 nodes) and conclude that D-ADMM and D-Lasso 
are the "fastest" algorithms. In the second type of experiments 
we compare only these two algorithms in a more thorough way 
for the same networks, and we also see how their performance 
varies as the network size increases (from 2 nodes to 1024 
nodes). Finally, we address the column partition case. 

Experimental setup. We considered five distinct scenarios 
with different dimensions and different types of data, shown 
in Table M The data (matrix A £ R mxn and vector 6 e R m ) 
was taken from the Sparco toolbox l50l . except in scenario 1, 
where we used a 500 x 2000 matrix with i.i.d. Gaussian entries 
with zero mean and variance 1/y/m. In each scenario, each 
node stores m p — m/P rows of A. We ensured that m p — 



Table III 

Network models for the experiments. 



Network number 


Model 


Parameters 


1 


Erdos-Renyi 


p = 0.25 


2 


Erdos-Renyi 


p = 0.75 


3 


Watts-Strogatz 


(n,p) = (4,0.6) 


4 


Watts-Strogatz 


(n,p) = (2,0.8) 


5 


Barabasi-Albert 




6 


Geometric 


d = 0.75 


7 


Lattice 





m/P is an integer by considering two values for P: 50 and 64, 
chosen depending on the scenario. 

In the following, x* denotes the solution of BP obtained 
by the Sparco toolbox, or in scenario 1, the one obtained by 
CVX [51], solving BP as a linear program. Note that due to 
the dimensions of the matrices and their randomness/structure, 
x* is guaranteed to be unique with overwhelming probability. 

For each scenario we ran all algorithms for the seven 
different networks shown in Table [III] For each network in 
Table iHll we generated two networks: one with 50 nodes (used 
in scenarios with P — 50), the other with 64 nodes (used in 
scenarios with P = 64). The parameters of the networks were 
chosen so that the generated network would be connected with 
high probability. Only for network 4, P = 50 we did not get 
a connected network at first, so we changed the parameters 
to (3, 0.8). If the generated network had self-connections or 
multiple edges between the same pair of nodes, we would 
remove them. We also generated 10 networks with 2* nodes 
(i = 1, .. . , 10), all following the model of network 3. These 
are used in the type II experiments (explained below). 

The Erdos-Renyi model [52] has one parameter p, which 
specifies the probability of any two nodes in the network being 
connected. The Watts-Strogatz model [53] has two parameters: 
the number of neighbors n and the rewiring probability p. First 
it creates a lattice where every node is connected with n other 
nodes; then, every link is rewired, or not, with probability p. 
If a rewiring occurs in link then we pick node i or j 

(with equal probability) and connect it with other node in the 
network, chosen uniformly. The Barabasi-Albert model [54] 
starts with one node; at each step, one node is added to 
the network and is connected to one of the nodes already 
in the network. However, the probability of the new node 
"choosing" to connect to the other nodes is not uniform: it 
is proportional to the nodes' degrees such that the new node 
has a greater probability of connecting to the nodes with larger 
degrees. The geometric model [55] deploys P nodes randomly 
(uniformly) in the unit square; then, two nodes are connected 
if their distance is less than d. Finally, the Lattice model has 
no randomness. For P nodes, it generates a rectangular grid 
graph in the plane such that the shape is as square as possible 
(5 x 10 for P = 50 and 8 x 8 for P = 64). Each node has 
four neighbors except for the borders. This lattice network is 
the only one guaranteed to be bipartite, and thus Algorithm [2] 
is only guaranteed to converge for this network. 

We used an heuristic from the Matgraph toolbox ll56ll to 
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Table IV 




Types of experiments. 


Type of experiment 


Value of p 


I 


p = 1 for D-ADMM and D-Lasso 




p = 10 for MM/NGS, MM/DQA, and DN 


II 


p e {io -3 , icr 2 , io _1 , 10°, io 1 } 




the value that leads to the best results is picked 



find a coloring for these networks. It is then possible that the 
number of colors is larger than x(G) ■ We checked that the 
optimal solution of two colors was found for the Lattice model. 

Results. As mentioned before, we keep the parameter p 
fixed during the execution of the algorithms (except for the 
subgradient method, which has no p). We picked p in two 
different ways, yielding two types of experiments, shown in 
Table [IV] In type I, p was always the same for all scenarios 
and all networks: p = 1 for D-ADMM (Algorithm 13 and 
for D-Lasso (Algorithm H), and p = 10 or the double-looped 
algorithms MM/NGS, MM/DQA, and DN. These values were 
chosen based on some pre-testing. In the type II experiments, 
given a fixed scenario and network, we execute each algorithm 
for several p's and pick the one that yields the best result. 
For the type II experiments, we only considered the best two 
algorithms: D-ADMM and D-Lasso. 

The two types of experiments reflect two different philoso- 
phies in the assessment of algorithms that depend on parame- 
ters: type I represents real-world applications (the parameters 
are tuned for known data and are then used unchanged); type II 
is more suited to assess the true capabilities of the algorithm. 

Type I experiments. Figure shows the results of the type I 
experiments. The left-hand (resp. right-hand) side plots show, 
for each network, the number of communication steps until 
each algorithm achieves a precision of 1% (resp. 10~ 3 %) at a 
randomly selected node p. This means we count the number of 
communication steps until — < 10~ 2 or 10~ 5 . 

We allowed a maximum number of 10 4 communication steps. 

In Figure we observe that the behavior of the al- 
gorithms in all scenarios, except in scenario 3, is iden- 
tical, so we will focus only on scenarios 1 and 3. Fig- 
ure |3??SubFig:RPExperimentsScenl) shows that, for sce- 
nario 1, D-ADMM requires the least number of communi- 
cations to achieve both accuracies regardless the network. 
We can also see that for this scenario MM/DQA, DN, and 
Subgradient always reached the maximum number of 10 4 
iterations before achieving any of the prescribed accuracies. 
As stated before, the behavior of the algorithms for the 
remaining scenarios (except scenario 3) is very similar. In 
scenario 3, Figure |5l??SubFig:RPExperimentsScen3), we see 
a different behavior: while D-ADMM required less com- 
munications than any of the p-dependent algorithms, the 
Subgradient required less communications to achieve the ac- 
curacy 1% for networks 1, 2, and 6. However, if we let 
the algorithms continue executing, the Subgradient reaches 
the maximum number of communications before achieving 
the 10 _3 % of accuracy, as can be seen in the right-hand 



plot of Figure |3??SubFig:RPExperimentsScen3). Note that 
the relative behavior of the remaining algorithms is roughly 
the same for both accuracies. 

In Figure [6] we show how the error of the estimate x. p at a 
random node p varies along the iterations, for each algorithm. 
Figure 0fa)l shows the error for scenario 1 when the algorithms 
are executed in network number 4. Notice that the number 
of communications to achieve accuracies of 1% and 10 _3 % 
agree with the plots of Figure | ^Ja)| for example D-ADMM 
takes less than 10 3 communication steps to achieve a 10 -5 
precision. Figure | ^Jb)| shows the errors for scenario 3 when 
we use network 3 (cf. with the plots of Figure | 3Ic)[ ). Note 
the similarity of the curves of D-ADMM and D-Lasso: they 
have the same shape but the D-ADMM error is always smaller. 
This might happen because both methods use the same internal 
algorithm, albeit applied to different reformulations. Finally, 
note in Figure | ^b)| how the error of the Subgradient evolves 
for scenario 3, network 3: the rate of convergence is very fast 
at the beginning, but after the first 1000 iterations it becomes 
very slow. This agrees with what was observed in Figure | ^Jc)| 

Type II experiments. For the type II experiments we only 
considered the two best algorithms: D-ADMM and D-Lasso. 
Figure [7] shows for each network the number of communi- 
cation steps to reach an accuracy of 10 _3 %. We allowed 
for maximally 3000 communication steps (these were only 
achieved by D-Lasso in scenario 3 for networks 3, 4, and 5, 
as can be seen in Figure | 2tb)[ ). We observed that the best 
values of p for D-ADMM were always 10~ 2 , 10 _1 , or 1. 
For example, D-ADMM had the best performance for p = 1 
for scenarios 1, 3, and 5 when the networks were either 5 
or 7. For instance, for scenario 1, network 5 D-ADMM 
took 462 communication steps (see Figure | 2fa)[ ), the same 
number observed in the type I experiments, in right-hand plot 
of Figure | 3Ja)| Recall that p was fixed at 1 for D-ADMM in 
the type I experiments. This also means that in the type II 
experiments the number of communications for D-ADMM 
decreased except for scenarios 1, 3, and 5 when the networks 
were either 5 or 7. The same phenomenon happened for D- 
Lasso: the optimal p was 1 only in scenarios 1 and 5 for the 
5th network; and the optimal p's were 10~ 2 , 1Q — 1 , or 1. 

We conclude from Figure [7] that D-ADMM requires less 
communication steps than D-Lasso, independently of the sce- 
nario or network type. Excluding the cases D-Lasso reached 
the maximum number of iterations, we see that in average D- 
ADMM uses 51% of D-Lasso's number of communications 
(11% of standard deviation). The largest difference occurred 
in scenario 3, network 6, where D-ADMM used 35% of 
the communications D-Lasso used; this number was 78% for 
scenario 4, network 1, the smallest difference that occurred. 

Figure [8] shows another type II experiment: we fixed the 
scenario and network type: Scenario 3, Watts-Strogatz with 
parameters (4, 0.6); and observed how the number of commu- 
nication steps varies as the size of the network increases. The 
number of nodes varied from 2 (each node stores 512 rows) 
to 1024 (each node stores 1 row) and was always a power 
of 2. D-ADMM and D-Lasso stopped after reaching 0.1% of 
accuracy. As shown by the gray straight lines in Figure [H] 
the communication steps in both algorithms increases approx- 
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Figure 4. Type I experiments: number of communication steps to reach accuracies of 1% and 10 3 % as a function of the network (see Table ITTlV 
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Figure 6. Type I experiments: errors along the iterations (communication steps) of the algorithms for fixed scenarios and networks. 



imately linearly in a log-log plot. The model we used to 
compute those lines was log 10 C — a log 2 P + (3, where C is 
the number of communication steps, P the number of nodes, 
and a and f3 are the parameters to be found for each line. The 
minimum least squares error yielded (a,/3) = (0.243,1.07) 
for D-ADMM and (a, 0) = (0.233, 1.47) for D-Lasso. There- 
fore, C ~ 11.7- P o s for D-ADMM and C ~ 29.5 • P 77 for 
D-Lasso, showing a less-than-linear increase of the commu- 
nication steps with the number of nodes, for both algorithms. 
Also, the difference between the lines' offsets reveals that D- 
Lasso took in average 2.5 times more communications than D- 
ADMM. The average number of colors was 4.6, which means 



that in a collision-free network D-ADMM would be 1.8 times 
slower than D-Lasso. Again, the optimal p's were 10~ 2 , 10~\ 
or 1, but we noticed a curious pattern on both algorithms: 
the optimal value for p decreased as the size of the network 
increased. 

Results for the column partition. For the column partition 
we only executed type II experiments. While the scenarios 
were the same as before (Table Hill, we changed the networks: 
they now have 10 nodes (for scenarios 1, 2, and 4) or 8 nodes 
(for scenarios 3 and 5). All nodes thus store the same number 
of columns, i.e., the number of columns n is divisible by the 
number of nodes P. The model for generating these networks 
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Figure 7. Type II experiments: number of communication steps to reach 10 3 % of accuracy or 3000 communication steps. 



Communication steps 




10° 

2i 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 2 10 

Number of nodes 

Figure 8. Type II experiments for row partition: number of communication 
steps to reach 0.1% of accuracy as a function of the network size. The straight 
lines represent a linear fit. 

is the same as in Table [TTTJ In all experiments we set the 
regularization parameter to 5 = 1CP 3 , a value that always 
allowed the recovery of the solution of BP, as we will see. 

Figure [9] shows the plots with the results of the type II 
experiments. As before, D-ADMM always required less com- 
munication steps to achieve a 10~ 3 % of accuracy. In particular, 
D-ADMM used in average 42% of the communications D- 
Lasso used; the standard deviation was 10%. The largest 
difference in the number of communications occurred in 
scenario 2, network 4, where D-ADMM only used 28% of the 
communications that D-Lasso used. The smallest difference 
was 72% and it occurred in scenario 5, network 5. We mention 
that, in contrast with the row partition, there were cases in 
which the optimal value for p was 10~ 3 and 10 (cf. Table IIVb . 
the "boundary" values of the set of p's we used. Therefore, 
we might improve the results if we try a wider range of p's. 

VI. Final Remarks and Conclusions 

We proposed an algorithm for solving BP in two distributed 
frameworks. In one framework, the BP matrix is partitioned 
by rows, with its rows distributed over a network with an 
arbitrary number of nodes; in the other framework, it is 
the columns of the matrix that are distributed. The only 
requirement on the topology of the network through which the 



nodes communicate is connectivity (and we also assume that 
this topology does not change along the algorithm). Therefore, 
our algorithms can be applied to several scenarios, ranging 
from sensor networks, where the communication network is 
usually sparse, to super-computing platforms, characterized by 
dense networks. 

We simulated our algorithms for several types of data 
and networks and conclude that they always require less 
communications than competing algorithms. This is paramount 
in energy-constrained environments such as sensor networks. 
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where F(X) = M xeX f(x) 
M yeY g(y) + ^ T By. 



X T Ax and G(X) = 



Furthermore, ll58l recently proved that ADMM converges 
with rate 0(1 /k). This rate holds even if the quadratic term 
of (f>\(z) in ( fZSb is linearized, which can many times simplify 
the solution of that optimization problem. For more properties 
of ADMM and its relation to other algorithms see |59], 1 60 1 . 

We now present a generalization of ADMM, which we call 
"generalized ADMM." The generalized ADMM solves: 



minimize J2i=i fi( x i) 
subject to Xi € Xi , i = 1, . 

Si=l AiXi = , 



(28) 



where (a_., . . . ,£_) is the variable, I > 2, the functions fi 
are convex, Aj are full column-rank matrices, and Xi are 
polyhedral sets. The generalized ADMM solves d28T l with: 



„(fc+i) 



arg min fi(xi) + 4> X (k)(AiXi + } A^ 

116X1 c — ' J 

J>1 
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Appendix A 

Alternating Direction Method of Multipliers 

Let / and g be two real-valued convex functions and X 
and Y two polyhedral sets. Let also A and B be two full 
column-rank matrices, and consider the problem 

fix) + g(y) 



„(fc+i) _ 



minimize 

x£X,yeY 

subject to Ax + By = , 



(24) 



= arg min fi(xi) + cf> xm (A { x t + } A 



j X j 



(fc+1) 



3<i 



„(fc+i) _ 



arg min fi(xj) + cb xW (A/x/ + } A 



J X j 



with variables x and y. The alternating direction method of i 

multipliers (ADMM) _3__, __0) solves d___> by applying the = A (fc) + p^A 
method of multipliers ll40l p.408] concatenated with one single i = i 
loop of the nonlinear Gauss-Seidel [40 p. 272] 

x (k+i) 



„(fe+i) 



y 



(fc+i) = 



axgmhif(x)+(j) xW {Ax + By^) (25) 
argmin 5 (y) + <P x(k) (Ax {k+1) + By) (26) 



A (*+i) = A W + p (Ax {k+1) + By^) , 



(27) 



where <f> x (z) = A T z + |||z|| 2 and p > is a tuning parameter. 
In words, the augmented Lagrangian 

L(x, y- A) = f{x) + g(y) + X T (Ax + By) + £\\Ax + By\\ 2 , 

is first minimized with respect to x and then, keeping the value 
of x fixed at the just computed value x^ k+1 \ the augmented 
Lagrangian is minimized with respect to y. Thus, (l25T l and d26l i 
cannot be carried out simultaneously. After these minimization 
steps, the dual variable A is updated in a gradient-based way 
via ([27). The following theorem guarantees its convergence. 

Theorem 4 (___), _2__, El). Let f : R" 1 — > K and g : 

R" 2 ->• M be convex over R™ 1 and R™ 2 , respectively. Let X C 
R" 1 and X C R™ 2 be polyhedral sets and let A and B two 
full column-rank matrices. Assume (I24l > is solvable. Then, 

1) {(x^ , y( k ')} converges to a solution of d24l ); 

2) {\( k ^} converges to a solution of the dual problem 

maximize F(X) + G(X) 
X 



This algorithm is then the natural generalization of (f25l>-(f27l>. 
It is not known yet if Theorem|4]also applies to the generalized 
ADMM. The latest efforts for doing that can be found in |61 1, 
[62], [63 1, 1 64- 1 . In spite of this fact, we apply the generalized 
ADMM in this paper and the resulting algorithm never failed 
to converge in our simulations. 

Appendix B 
Problem for Each Node: Row Partition 

In the distributed algorithm we propose, each node has to 
solve, in each iteration, the problem 

(29) 



minimize + v ' x + c\\x\\ 

subject to Ax = b , 



where x £ R™ is the variable, and v £ W\ c> 0, A £ R mxn , 
and b £ R m are given. We propose to solve d29l ) by solving 
its dual problem: 

maximize A T 6 + \\ x i\ + Ui(X)xi + cxf) 

X 

(30) 

where the dual variable is A £ R m and u(X) = v — A T X. 
To compute the objective of this dual problem for a fixed A, 
we need to find the minimum Xi(X) of the function \xi \ + 
Ui(X)xi + cx\ for i = 1, . . . , n. Each one of these functions 
is strictly convex due to c > 0, and hence it has a unique 
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minimizer Xi(X), It follows from Danskin's theorem |40l 
prop. B25] that the objective of (l30b is differentiable with 
gradient b — Ax(X), where x(X) = (xi(A), . . . , x n (A)). By 
the optimal conditions for convex problems [40, prop.B24], 



The unicity of the minimizers x;(A) also implies that, once 
a solution A* of ( l30l > is known, the solution of d29| i is 
given by a;(A*). To solve i30\ , we propose using the method 
in BP . a very efficient Barzilai-Borwein (BB) algorithm. Per 
iteration, BB consumes 0(n) flops plus the flops necessary 
to compute the gradient. Furthermore, BB is known to con- 
verge i?-superlinearly for generic unconstrained optimization 
problems (§5] Th.4]. 

As a final note, the number of iterations to solve d29b can 
be drastically reduced by using warm-starts. This means that, 
at iteration k + 1, node p will solve ( f29T > by starting the BB 
algorithm with the solution found in iteration k. The solutions 
of these two consecutive problems are expected to be close, 
since only v and c changed, possibly just by a small quantity. 




